-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Support customised S3 servers endpoint URL #29050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/s3.py
Outdated
# the format: http(s)://{host}:{port} | ||
s3_endpoint = os.environ.get('S3_ENDPOINT') | ||
if s3_endpoint: | ||
client_kwargs = {'endpoint_url': s3_endpoint} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi xieqihui :) I believe there needs to be a type annotation, such as
client_kwargs: Optional[Dict[str, str]] = {"endpoint_url": s3_endpoint}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder. I have added the type annotation and format the code.
pandas/io/s3.py
Outdated
# Support customised S3 servers endpoint URL via environment variable | ||
# The S3_ENDPOINT should be the complete URL to S3 service following | ||
# the format: http(s)://{host}:{port} | ||
s3_endpoint = os.environ.get('S3_ENDPOINT') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the CI process checks that the code would be left unchanged by the black autoformatter.
Try running
black pandas/io/s3.py --diff
to see the changes (and leave out the --diff
flag to apply them)
Is S3_ENDPOINT a common variable in the S3 ecosystem? Should s3fs just be using it as the default value for Ideally, pandas has as little s3-specific code as possible. We'd like to delegate it to other libraries. |
Hi @TomAugspurger. In this commit, if
|
# the format: http(s)://{host}:{port}. If S3_ENDPOINT is undefined, it will | ||
# fallback to use the default AWS S3 endpoint as determined by boto3. | ||
s3_endpoint = os.environ.get("S3_ENDPOINT") | ||
client_kwargs: Optional[Dict[str, str]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this be better as a feature directly to s3fs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, here comes an awkward situation: From the point view of s3fs and boto3, they already support customised endpoint by accepting parameters endpoint_url
via client_kwargs
, so there is no point to add environment variable support for this parameter there, as discussed on this boto3 issue.
The problem with Pandas now is that it doesn't provide a way to pass this parameter over to s3fs. I think adding support using this environment variable will benefit many people using their own s3 servers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't using the environment variable as the default endpoint_url go in s3fs, to benefit even more people?
Agreed with Jeff. I would prefer to see this in s3fs. |
closing as out of scope for pandas, encourage to submit to s3fs. |
Would appreciate is people on this thread would comment in fsspec/s3fs#273 as to how they would expect users to codify such arguments. I suppose environment variables is certainly an option, but passing arguments would be preferable. The linked issue talks about embedding these arguments in the URL itself. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Add support for customised S3 servers via checking the environment variable
S3_ENDPOINT
.If set, the value of
S3_ENDPOINT
will be passed to theendpoint_url
parameter forboto3.Session
. It should be the complete URL to S3 service following the format: http(s)://{host}:{port}.This feature is useful for companies who use their own S3 servers like MinIO, Ceph etc, as mentioned in issue #26195